Skip to content

Support manual resource elastic for allreduce#1714

Open
yifeng-x wants to merge 3 commits into
intelligent-machine-learning:masterfrom
yifeng-x:allreduce_resource_elastic
Open

Support manual resource elastic for allreduce#1714
yifeng-x wants to merge 3 commits into
intelligent-machine-learning:masterfrom
yifeng-x:allreduce_resource_elastic

Conversation

@yifeng-x
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Support dynamic manual adjustment of training task resources (CPU/memory/GPU count) at runtime.

Why are the changes needed?

When training exceptions occur due to insufficient resources, the system can automatically adjust resources and restart training, such as in the case of OOM.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

yifeng-x and others added 3 commits April 20, 2026 11:19
…-machine-learning#1718)

* stash 20260304

* add link to detail

* add mock page

* add job config

* add job config

* done failover consanguinity

* fix

* rm dashboard service

* optimized html

* optimized

* rm md

* optimized

* optimized

* add dashboard configuration

* update test config

* fix

* fix

* retrigger

* lint

* optimized

* fix ut
@yifeng-x yifeng-x force-pushed the allreduce_resource_elastic branch from e4e1df1 to d015efd Compare April 20, 2026 03:22
Copy link
Copy Markdown
Collaborator

@workingloong workingloong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants